DiMSum version: 1.2.11
Project name: Bri2_12NNK_03
Run Started: 2023-06-23 17:33:52
Run Completed: 2023-06-23 17:51:13
Command-line arguments:
## runDemo FALSE
## fastqFileDir /users/blehner/bbolognesi/mmartin/Bri2_ADan_ABri/Bri2_NNK/
## fastqFileExtension .fastq
## gzipped TRUE
## stranded TRUE
## paired TRUE
## barcodeErrorRate 0.25
## experimentDesignPath /users/blehner/bbolognesi/mmartin/Bri2_ADan_ABri/Bri2_NNK/experiment_design_03.txt
## experimentDesignPairDuplicates
## FALSE
## cutadapt5First CAAATTTGCCGTGGAAACTTTAATTTGT
## cutadapt5Second GGTGGCGGCCGCTCTAGATTA
## cutadaptMinLength 36
## cutadaptErrorRate 0.2
## cutadaptOverlap 3
## vsearchMinQual 30
## vsearchMaxee 0.5
## vsearchMinovlen 10
## outputPath /users/blehner/bbolognesi/mmartin/Bri2_ADan_ABri/Bri2_NNK/
## projectName Bri2_12NNK_03
## wildtypeSequence TGGAAGGGGACGATGGTTGTGGGTAGTAATTGGCCG
## permittedSequences NNKNNKNNKNNKNNKNNKNNKNNKNNKNNKNNKNNK
## reverseComplement FALSE
## sequenceType coding
## mutagenesisType codon
## transLibrary FALSE
## transLibraryReverseComplement
## FALSE
## bayesianDoubleFitness FALSE
## bayesianDoubleFitnessLamD
## 0.025
## fitnessMinInputCountAll
## 0
## fitnessMinInputCountAny
## 100
## fitnessMinOutputCountAll
## 0
## fitnessMinOutputCountAny
## 0
## fitnessHighConfidenceCount
## 10
## fitnessDoubleHighConfidenceCount
## 50
## fitnessNormalise TRUE
## fitnessErrorModel TRUE
## indels none
## maxSubstitutions 12
## mixedSubstitutions TRUE
## retainIntermediateFiles
## TRUE
## splitChunkSize 3758096384
## retainedReplicates all
## startStage 1
## stopStage 5
## numCores 10
The DiMSum pipeline consists of five stages grouped into two modules which can be run independently:
Below you will find summary plots with results of each stage corresponding to the module(s) that were run.
DiMSum Stage 1 (QC) summarises base qualities from each raw FastQ file using FastQC.
The plot below shows 10th percentile (upper) and mean (lower) Phred quality scores at the indicated positions in the forward reads (Read 1) in all FastQ files (see legend).
If mean read qualities are low (Phred score<30) in the constant region sequence, it might be necessary to increase the maximum allowable number of mismatches during trimming i.e. Stage 2 ('cutadaptErrorRate'). If qualities are low in the variable region sequence, it may be necessary to adjust Stage 3 (ALIGN) options ('vsearchMinQual', 'vsearchMaxee'). Be aware that changing these options from their defaults can severely impact the number of 'fake' (spurious) variants due to sequencing errors. See DiMSum documentation for details.
The plot below is similar to the one above except quality scores for reverse reads (Read 2) are shown.
DiMSum Stage 2 (TRIM) removes constant region sequences at the start (5') and/or end (3') of each read with Cutadapt if required.
The plot below shows the percentage of forward reads (Read 1) in which the specified constant regions were matched and trimmed (see legend), shown separately for each FastQ file.
Untrimmed reads (or read pairs) are discarded if constant region sequences are specified but not found. Trimmed reads are also discarded if the trimmed sequence length is too short ('cutadaptMinLength'). If the percentage of trimmed reads is low, check that constant region sequences were correctly specified ('cutadapt5First', 'cutadapt5Second', 'cutadapt3First', 'cutadapt3Second'). It may also be necessary to increase the maximum allowable number of mismatches ('cutadaptErrorRate') if sequence qualities are low or decrease the minimum allowable overlap between read and constant region ('cutadaptOverlap') if constant region sequences are very short (<3bp). See DiMSum documentation for details.
The plot below is similar to the one above except trimming statistics for reverse reads (Read 2) are shown.
DiMSum Stage 3 (ALIGN) aligns paired-end reads using VSEARCH. This stage also filters the resulting variant sequences based on minimum base quality, total number of expected base calling errors and sequence length. If reads are the result of single-end sequencing, these same filters are applied.
The plot below shows the total percentage of reads (or read pairs) retained for downstream analysis ('vsearch_aligned'), shown separately for each FastQ file. Remaining reads are discarded. Details of each category are as follows:
If the percentage of reads retained is low (<<50%), the above options may need to be adjsted. See DiMSum documentation for details.
The plot below shows variant sequence length distributions after alignment, shown separately for all samples. The upper quartile, lower quartile and median are show in each case (see legend). Check that the median sequence length is as expected (e.g. wild-type sequence length without indels).
DiMSum Stage 4 (PROCESS) processes sequences and filters them in order to retain user-specified nucleotide or amino acid substitution variants of interest. The result is a table of variant counts for all samples. Read count diagnostic plots can then be used to rapidly check for the presence of problematic variants (likely the result of sequencing errors) and take steps to remove them (see Sections 4.1 and 4.2 below).
The plot below shows the percentage of reads retained or discarded in each sample according to the following criteria:
Note: The plots below show read counts before application of user-specified count thresholds.
See DiMSum documentation for more details.
Nucleotide variant statistics (counts). The plot below is similar to the one above instead the total number of reads (rather than the percentage) in each sample is shown.
Amino acid variant statistics (percentages). The plots below are similar to the ones above instead amino acid (rather than nucleotide) hamming distances are shown.
Amino acid variant statistics (counts).
The diagnostic plot below shows marginal variant count distributions separately for all Input samples, first stratified by the number of amino acid substitutions and then stratified by the number of nucleotide substitutions (Hamming distance to the wild-type sequence). Distributions corresponding to Hamming distances greater than 6 are not shown. Wild-type sequence counts are indicated by the black vertical dashed line.
Expected counts from 'fake' variants (due to base-call errors at a rate corresponding to the 'vsearchMinQual' option) are indicated by coloured dashed lines. Bimodal distributions (or unimodal distributions not surpassing the indicated thresholds) indicate variants originating from sequencing errors likely due to a library 'bottleneck'. A minimum input count threshold should be chosen to remove such variants (see 'fitnessMinInputCountAll' option applied in Stage 5 and DiMSum documentation for more details.).
Note: The plot below shows variant counts before application of user-specified count thresholds.
The diagnostic plot below is a scatterplot matrix depicting correlations between variant counts from all Input and Output samples. Matrix cells in the upper triangle show Pearson correlation coefficients. Matrix cells in the lower triangle show scatterplot equivalents (hexagonal heatmaps of 2d bin counts). Matrix diagonal cells indicate count densities.
Distinct variant populations or 'flaps' i.e. subsets of variants that appear at high counts in one replicate but at low counts in another (and not due to selection) indicate replicate or DNA extraction 'bottlenecks'. Minimum input and/or output count thresholds should be chosen to remove such variants (see 'fitnessMinInputCountAll', 'fitnessMinInputCountAny', 'fitnessMinOutputCountAll' and 'fitnessMinOutputCountAny' options applied in Stage 5 and DiMSum documentation for more details).
Note: The plot below shows variant counts before application of user-specified count thresholds.